Purpose &
Project Metadata

This project explores the Medicaid Drug Rebate Program (MDRP) database via API calls (description here):

  1. Data Dictionary (Medicaid.gov)
  2. MDRP Data
  3. openFDA Drug Data (used to augment MDRP data)

Required LIbraries

CRAN GitHub
  • purrr
  • jsonlite
  • httr
  • summarytools
  • munsell
  • cachem
  • SmartEDA
  • htmltools
  • slider
  • stringi
  • magrittr
  • plotly
  • DT
  • data.table
  • pdftools
  • lubridate
  • future
  • furrr

Data
Wrangling

Retrieve and Prepare Data

The data were retrieved via R package httr with some initial conversion to data.table objects. Core objects were cached to disk (cachem) for easy retrieval after the initial pull.

Data Sets

MDRP Data

if (!"api_data" %in% .cache$keys()){ 
  download_temp <- tempfile()
  
  as.character(urls$data) |> 
    stri_extract_all_regex("http.+csv", simplify = TRUE) |> 
    as.vector() |>
    download.file(destfile = download_temp) 
    
  .tmp_obj <- read.csv(download_temp) |> as.data.table(na.rm = FALSE) 
}

if (!"api_data" %in% ls()){ 
  makeActiveBinding("api_data", function(){ .cache$get("api_data") }, env = globalenv())
}

Formatting updates include the following:

  • Convert field ‘NDC’ and fields ending in ‘Code’ to characters (numeral-encoded nominal values)
  • Converting date fields to date format
  • Replace ‘.’ in field names with ’ ’

Dictionary

if (!"api_dictionary" %in% .cache$keys()){ 
  .cache$set("api_dictionary", invisible( 
    as.character(urls$data) |> 
    stri_extract_all_regex("http.+pdf", simplify = TRUE) |> 
    as.vector() |>
    GET() |>
    content() |> 
    pdf_text()))
}  

.summary_labels <- invisible({
  .pattern <- c("Pkg"
                , "Intro"
                , "COD Status"
                , "FDA Application Number"
                , "FDA Therapeutic Equivalence Code"
                );
  .replacement <- c("Package"
                , "Intro.+Date"
                , "Covered Outpatient Drug [(]COD[)] Status"
                , "FDA Application Number/OTC Monograph Number"
                , "TEC"
                );
  
  names(api_data) |>
    rlang::set_names() |>
    imap_chr(\(x, y){
      .out <- .cache$get("api_dictionary") |> 
        stri_extract_all_regex(
          sprintf(
            fmt = "(%s)[:]\n.+"
            , stri_replace_all_fixed(str = x, pattern = .pattern , replacement = .replacement, vectorize_all = FALSE)
            )
        , simplify = TRUE
        ) |>
        stats::na.omit() |>
        as.vector() |> paste(collapse = "\n")
      
      if (rlang::is_empty(.out)){ 
        y 
      } else{ 
        .out <- paste(.out, collapse = "\n")
        ifelse(stringi::stri_length(.out) > 50, paste0(stri_sub(.out, length = 50), " ..."), .out)
      }
    })
})

.tmp_obj <- api_data;

iwalk(.summary_labels, \(x, y){ 
  .tmp_obj <<- modify_at(.tmp_obj, y, \(i){ attr(i, "label") <- x; i }) 
})

.cache$set("api_data", .tmp_obj)

OpenFDA Data

if (!"open_fda_ndc" %in% .cache$keys()){
  json.file <- paste0(params$data_dir, "/drug-ndc-0001-of-0001.json");
  download.file <- tempfile();
  
  if (!file.exists(json.file)){ 
    tags$p(sprintf("Retrieve data from '%s'", urls$openFDA)) |> print()
    
    GET(urls$data$children |> stri_extract_first_regex("https.+json.zip"),
      write_disk(path = download.file, overwrite = TRUE));
    
    unzip(zipfile = download.file, exdir = "data")
  }
      
  .cache$set("open_fda_ndc", { read_json(path = json.file) %$% {
    map(results, as.data.table) |> rbindlist(fill = TRUE) |>  
      setattr("metadata", meta)}
    })
}

if (!"openFDA_ndc" %in% ls()){ 
    makeActiveBinding("openFDA_ndc", function(){ .cache$get("open_fda_ndc")}, env = environment())
  }

NDC Format Inspection

NDC sequences come in a various formats, usually a 4-4-x, 5-4-x, or 5-3-x sequence (each integer indicating string length). Sometimes other formats arise, so normalizing all NDC sequences is a good idea, especially when there is a desire (or need) to join different data containing intersecting NDCs.

The following shows proportional representation of NDC formats in the OpenFDA and MDRP data, respectively:

The MDRP has many more NDC sequences due to truncation of leading zeroes. Fortunately, an NDC sequence is a collection of code segments (present in the data) concatenated with a hyphen. Knowing this, A function (check_ndc_format()) was created in order to derive conformed NDC segment sequences (using labeler and product codes) based on the OpenFDA sequences, allowing the MDRP and OpenFDA data to be joined later in the process.

check_ndc_format()

check_ndc_format <- \(lc, pc){ 
  pc <- modify_if(
          pc, \(i) stri_length(i) < 3
          , \(i) stri_pad_left(i, width = 3, pad = "0")
          )
  lc <- modify_if(
          lc
          , \(i) stri_length(i) < 4
          , \(i) stri_pad_left(i, width = ifelse(stri_length(pc) == 3, 5, 4), pad = "0")
          )
  paste(lc, pc, sep = "-")
}

Master Drug
Data

The OpenFDA and MDRP data were joined using the conformed NDC sequence in the previous subsection to create master_drug_data sampled below (Note: due to the size of the data, the join operation takes some time which is why the result is disk-cached for later retrieval. This caching approach is used for data retrieved or created during the data wrangling phase.):

if (params$refresh || !"master_drug_data" %in% .cache$keys()){
  .cache$set("master_drug_data", (\(x, i){
    x[i
      , on = "alt_ndc==product_ndc"
      , allow.cartesian = TRUE
      , `:=`(pharm_class = pharm_class
             , dea_schedule = dea_schedule
             , product_type = product_type
             , route = route
             , marketing_category = marketing_category
             )
      , by = .EACHI
      ][
      , `:=`(
        pharm_class = map_chr(pharm_class, \(x) unlist(x) %||% "~")
        , route = map_chr(route, \(x) unlist(x) %||% "~")
        )
      ]
    })(
    api_data %>% 
      setnames(stri_replace_all_fixed(names(.) |> tolower(), " ", "_")) %>% 
      .[, alt_ndc := map2_chr(labeler_code, product_code, check_ndc_format)]
    , openFDA_ndc
    ))
}
if (!"master_drug_data" %in% ls()){ 
  makeActiveBinding("master_drug_data", function(){ .cache$get("master_drug_data") }, env = globalenv());
}

Event Metrics

master_drug_data is a great data set for constructing simple, time-based metrics. Given the natural order of the types of events, it is easy to setup event sequence metrics using package lubridate. The metrics to be created are described below:

Metric Name Description
days_to_market Days between approval and market release
on_market_age Days active on market
days_market_absent Days most-recently absent from market

NDC Events
Creation

My next task was to add date-differential metrics mentioned above. As an intermediate object, I created ndc_events by looking at what appears to the be natural chronology of dates: fda_approval_date -> market_date -> termination_date -> reactivation_date.

Some of the values in columns termination_date and reactivation_date are NA indicating the event did not happen. This would obviously need to be addressed in deriving the metrics logic, and after several rounds of trial-and-error, I worked out such logic, discovering the following in the process:

  • The metrics are hierarchically-contingent based on whether or not NA values exist and if so, if which of ,, or both
  • Some values in termination_date and reactivation_date are future-dated relative to “today”: these were converted to NA before deriving the metrics as they haven’t happened yet (NA \(\equiv\) ‘Didn’t happen (yet)’)
  • A small subset of observations having the FDA approval date after the listed market date

The resulting object was captured in ndc_events_clean:

ndc_events_clean <- { 
  define(
    ndc_events
    , modify_at(.SD, c("termination_date", "reactivation_date"), \(x) ifelse(today() < x, NA, x))
    , cbind(
        .SD
        , define({
            .SD[, fda_approval_date:reactivation_date][, map(.SD, as.numeric)] |> 
              # dplyr::slice_sample(prop = 0.4) |>
              apply(1, \(x){
                c(x, diff(x) |> modify_if(is.na, \(i) 0) |> sign() %>% .[-1]) |> 
                  as.list() |>
                  modify_at(c(5, 6), \(i) i == 1) %>% 
                  rlang::set_names(names(.)[c(1:4)], paste0(names(.)[c(5, 6)], ".bool"))
                }, simplify = FALSE) |>
              rbindlist()
            }
          , days_to_market = market_date - fda_approval_date
          , on_market_age = 
              apply(.SD[, .(termination_date.bool, reactivation_date.bool, termination_date)]
                    , 1, function(i){ ifelse(i[[1]], ifelse(i[[2]], today(), i[[3]]), today()) }) -
              apply(.SD[, .(termination_date.bool, reactivation_date.bool, market_date, reactivation_date)]
                    , 1, function(i){ ifelse(i[[1]], ifelse(i[[2]], i[[4]], i[[3]]), i[[3]]) })
          , days_market_absent = 
              apply(.SD[, .(reactivation_date.bool, reactivation_date)]
                    , 1, function(i){ ifelse(i[[1]], i[[2]], today()) }) -
                apply(.SD[, .(termination_date.bool, termination_date)]
                    , 1, function(i){ ifelse(i[[1]], i[[2]], today()) })
          , ~days_to_market + on_market_age + days_market_absent
          )
      )
    , modify_at(.SD, c("termination_date", "reactivation_date"), \(x) as.Date(x, origin = "1970-01-01"))
  )}

(\(x, i, by){
  i <- define(x[i, on = by, allow.cartesian = TRUE]) ;
  imap(.ndc_events_meta, \(x, y){
    rlang::inject(descr(x = modify_at(i, y, \(j) as.numeric(j, units = "days")), var = !!rlang::sym(y), transpose = !TRUE)) |> 
      view(method = "render", table.classes = 'multi_stat', custom.css = "markdown.css") |>
      tags$td()
  })
})(master_drug_data, ndc_events_clean, c("alt_ndc", "fda_application_number", "market_date", "termination_date", "reactivation_date", "fda_approval_date")) |>
tags$tr() |>
tags$table()

Descriptive Statistics

days_to_market

N: 1418180
days_to_
market
Mean 785.99
Std.Dev 1849.98
Min -9859.00
Q1 0.00
Median 59.00
Q3 378.00
Max 11870.00
MAD 87.47
IQR 378.00
CV 2.35
Skewness 3.00
SE.Skewness 0.00
Kurtosis 9.41
N.Valid 1418180
Pct.Valid 100.00

Generated by summarytools 1.0.1 (R version 4.1.3)
2023-05-27

Descriptive Statistics

on_market_age

N: 1418180
on_market_
age
Mean 5297.41
Std.Dev 3320.22
Min 1.00
Q1 2748.00
Median 4397.00
Q3 7311.00
Max 11927.00
MAD 3034.88
IQR 4563.00
CV 0.63
Skewness 0.71
SE.Skewness 0.00
Kurtosis -0.58
N.Valid 1418180
Pct.Valid 100.00

Generated by summarytools 1.0.1 (R version 4.1.3)
2023-05-27

Descriptive Statistics

days_market_absent

N: 1418180
days_market_
absent
Mean 159.86
Std.Dev 636.39
Min -2999.00
Q1 0.00
Median 0.00
Q3 0.00
Max 11104.00
MAD 0.00
IQR 0.00
CV 3.98
Skewness 4.88
SE.Skewness 0.00
Kurtosis 30.43
N.Valid 1418180
Pct.Valid 100.00

Generated by summarytools 1.0.1 (R version 4.1.3)
2023-05-27

Some of the ‘Max’/‘Min’ values are negative; however, the relative number of records is small and more importantly, explainable:

  • days_to_market: Approval occurred after the market date
  • days_market_absent: Records where the termination date was non-NA but after the market date

NDC Events
Visualization

Combining the master drug data and event data (master_drug_data + ndc_events_clean), after some trial-and-error, I settled on the following showing the root-mean-square of metric values grouped by drug route: